MARDI Workshop

Transfer Functions

Gavin Simpson

Aarhus University

2024-12-05

Transfer functions I

In the 1970s and early 1980s there was a great deal of concern about acid lakes and rivers in northern Europe

Driven mainly by losses of Salmon in Scandinavian rivers, this was a major political hot potato

A vast amount of money was expended to determine the cause of the acidification — was it due to acid emissions from power stations or some other cause?

Palaeolimnological data provided conclusive proof that acid deposition was the cause In Europe, the Surface Waters Acidification Project (SWAP) was a major contributor to the debate

Transfer functions I

Diatoms collected from 167 lakes across UK, Norway, Sweden and associated water chemistry

Can we predict lake-water pH from the diatom species assemblages?

Apply to diatoms counted from a sediment core from the Round Loch of Glenhead (RLGH) covering most of the Holocene

Transfer functions II

Sea surface temperatures are related to global air temperatures

An important arm of palaeoceanography is involved in reconstructing past climates from various proxies

These past climates tell use how the world responded to previous climatic shifts and provide targets for climate modellers to try to model

The data set here is the Imbrie & Kipp data set — the data set that started it all! 61 core-top samples from ocean cores, mainly from Atlantic

27 species of planktonic foraminifera were identified in the core-top samples

Summer and Winter sea surface temperatures (SST) and sea water salinity values measured at each of the 61 core locations

Applied to reconstruct SST and salinity for 110 samples from Core V12-133 from the Caribbea

Transfer functions

Aim: Predict the environment from observations on species & environment

Transfer functions

Calibration

Bioindication

Inverse of constrained ordination

ter Braak (1995) Chemometrics and Intelligent Laboratory Systems 28: 165–180

Transfer functions

Matrix of species abundances \(\mathbf{Y}\)

Vector of observations of an environmental variable \(\mathbf{x}\)

Assume \(\mathbf{Y}\) is some function \(f\) of the environment plus an error term

\[ \mathbf{Y} = f(\mathbf{x}) + \varepsilon \]

In the classical approach \(f\) is estimated via regression of \(\mathbf{Y}\) on \(\mathbf{x}\)

Then invert \(f\), \(f^{-1}\), to yield an estimate of environment \(\mathbf{x_0}\) from fossil species assemblage \(\mathbf{y_0}\)

\[ \mathbf{\hat{x}_0} = f(\mathbf{y_0})^{-1} \]

In all but simplest cases \(f^{-1}\) doesn’t exist and must be estimated via optimisation

Transfer functions

To avoid problems of inverting \(f\), the indirect approach directly estimates the inverse of \(f\), here \(g\), from the data by regression \(\mathbf{x}\) on \(\mathbf{Y}\)

\[ \mathbf{x} = g(\mathbf{Y}) + \varepsilon \]

We do not believe that the species influence their environment!

This is just a trick to avoid having to estimate \(f\)

The predicted environment for a fossil sample \(\mathbf{y_0}\) is

\[\mathbf{\hat{x}_0} = g(\mathbf{y_0})\]

TF Assumptions

Taxa in training set are systematically related to the environment in which they live Environmental variable to be reconstructed is, or is linearly related to, an ecologically important variable in the ecosystem

Taxa in the training set are the same as in the fossil data and their ecological responses have not changed significantly over the timespan represented by the fossil assemblages

Mathematical methods used in regression and calibration adequately model the biological responses to the environment

Other environmental variables have negligible influence, or their joint distribution with the environmental variable of interest is the same as in the training set

In model evaluation by cross-validation, the test data are independent of the training data — the secret assumption until Telford & Birks (2005)

TF Methods

There are quite a few ways of estimating \(f\)

  • Weighted Averaging (WA-PLS)
  • Modern Analogues (MAT)
  • Gaussian Logistic Regression / Maximum Likelihood (GLMs)

Large number of potential techniques from machine learning, bioinformatics, that have yet to be investigated

Weighted averaging

Species don’t respond in simple ways to environmental gradients

Maximum likelihood method fitted Gaussian curves to each species and then numerical optimisation used to predict for fossil samples

Computationally very intensive, especially when doing cross-validation

Weighted averaging is an approximation to this maximum likelihood approach

Weighted averaging

A very simple idea

In a lake, with a certain pH, a species with their pH optima close to the pH of the lake will tend to be the most abundant species present

A simple estimate of the a species’ pH optimum is an average of all the pH values for lakes in which that species occurs, weighted by their abundance

An estimate of a lake’s pH is the weighted average of the pH optima of all the species present, again weighted by species abundance

Deshrinking

By taking averages twice, the range of predicted values is smaller than the observed range

Deshrinking regressions stretch the weighted averages back out to the observed range

Inverse and classical deshrinking regressions

  • inverse: regress gradient values on WA’s
  • classical: regress WA’s on gradient values
  • montonic: as inverse but using a monotonic spline
  • Vegan also allows to just make variances equal

Inverse and classical regression remove both bias and error, equalising variances deshrinks without adjusting the bias

WA in analogue

analogue contains R code for fitting WA transfer functions and associated helper functions

wa_m <- wa(SumSST ~ ., data = ik, deshrink = "inverse")
wa_m

    Weighted Averaging Transfer Function

Call:
wa(formula = SumSST ~ ., data = ik, deshrink = "inverse") 

Deshrinking  : Inverse 
Tolerance DW : No 
No. samples  : 61 
No. species  : 27 

Performance:
     RMSE  R-squared  Avg. Bias  Max. Bias  
   2.0188     0.9173     0.0000    -3.8155  

WA diagnostics

op <- par(mfrow = c(1,2))
plot(wa_m)
par(op)

WA prediction

pred <- predict(wa_m, V12.122)
pred

    Weighted Averaging Predictions

Call:
predict(object = wa_m, newdata = V12.122) 

Deshrinking     : Inverse 
Crossvalidation : none 
Tolerance DW    : No 

Performance:
   RMSEP        R2  Avg.Bias  Max.Bias  
  2.0188    0.9173    0.0000   -3.8155  

Predictions:
      0      10      20      30      40      50      60      70      80      90 
26.8321 26.7870 26.5611 26.1722 26.1857 26.1670 25.9064 26.0574 26.2797 25.6723 
    100     110     120     130     140     150     160     170     180     190 
26.1054 25.6092 25.8379 25.7696 25.7891 26.0105 25.8400 26.1986 26.0054 26.4729 
    200     210     220     230     240     250     260     270     280     290 
26.4282 26.5318 26.7689 26.7812 26.8077 26.0786 26.4078 26.3981 26.1494 26.4148 
    300     310     320     330     340     350     360     370     380     390 
26.2799 25.8553 26.0269 25.3974 26.0271 26.2423 26.3020 26.7047 26.7140 26.2727 
    400     410     420     430     440     450     460     470     480     490 
25.4927 26.7538 26.6039 26.6019 26.1936 26.7939 26.7742 26.2152 25.4620 26.7682 
    500     510     520     530     540     550     560     570     580     590 
26.8107 26.2679 25.7851 25.8562 25.5992 25.0000 25.3488 25.3794 25.3995 26.5347 
    600     610     620     630     640     650     660     670     680     690 
26.1509 26.1765 26.1447 25.8472 26.3835 26.3507 26.0932 24.5383 25.3052 26.6331 
    700     710     720     730     740     750     760     770     780     790 
26.3173 26.4848 26.0882 26.1193 26.1579 26.0043 26.3400 26.6920 26.9768 26.9926 
    800     810     820     830     840     850     860     870     880     890 
26.8074 26.4448 25.4736 25.8549 26.0450 26.2881 25.6021 26.1688 25.8223 24.1910 
    900     910     920     930     940     950     960     970     980     990 
24.4447 24.9817 25.4642 26.2359 26.4497 26.2772 26.1387 26.1874 25.8485 25.7372 
   1000    1010    1020    1030    1040    1050    1060    1070    1080    1090 
25.8538 24.8725 24.1065 24.4843 24.1864 25.6200 25.1869 24.8619 26.0186 25.6395 

Plot predictions

reconPlot(pred, use.labels = TRUE, ylab = "SumSST", xlab = "Depth")

Tolerance down-weighting?

We might expect those taxa with a narrow (realised) niche to be better indicators of \(\mathbf{x}\) than those taxa with a wide niche

Hence development of tolerance down-weighting in WA

Basically, weight taxa inversely in proportion to their estimated tolerance width

Sounds simple, but isn’t — could give infinite weight to a taxon found in a single training set sample!

Tolerance down-weighting?

If done correctly it produces very competitive models

Taxa with small tolerances get their tolerance replaced by

  1. minimum (non small) tolerance in the training set
  2. average tolerance width
  3. a fraction of the observed gradient (say 10% of SST gradient observed)
  4. the threshold for defining a small tolerance

Also use Hill’s N2 when computing tolerance widths to insure they are unbiased

MAT

Modern Analogue Technique

WA take a species approach to reconstruction — each species in the fossil sample that is also in the training set contributes to the reconstructed values

MAT takes a more holistic approach — we predict on basis of similar assemblages

In MAT, only the most similar assemblages contribute to the fitted values

MAT is steeped in the tradition of uniformitarianismthe present is the key to the past

We take as our prediction of the environment of the past, the (possibly weighted) average of the environment of the \(k\) sites with the most similar assemblages

Several things to define; \(k\), (dis)similarity

MAT is \(k\) nearest neighbours (\(k\)-NN) regression/calibration

Modern Analogue Technique

If you want to fit MAT models, my J. Stat. Soft. paper has fully-working examples to follow

Simpson, G.L., 2007. Analogue Methods in Palaeoecology: Using the analogue Package. J. Stat. Softw. 22, 1–29.

Or use rioja::MAT(), especially if you want h-block CV

rioja

analogue is getting a little out-dated now — it needs some love

rioja by is very similar, provides more functionality (WA-PLS, Maximum likelihood TFs, …) but most importantly, h-block cross-validation!

analogue has better support for the Modern Analogue Technique

library("rioja")
wa_r <- WA(y = ik, x = SumSST)
wa_r

Method : Weighted Averaging
Call   : WA(y = ik, x = SumSST) 

Tolerance DW       : No 
Monotonic deshrink : No 
No. samples        : 61 
No. species        : 27 
Cross val.         : none 

Deshrinking regression coefficients:
      Inverse d/s Classical d/s
wa.b0     -5.6876        5.8905
wa.b1      1.2659        0.7246

Performance:
          RMSE      R2  Avg.Bias  Max.Bias    Skill
WA.inv  2.0188  0.9173         0    3.8155  91.7299
WA.cla  2.1078  0.9173         0    3.1324  90.9842

WA-PLS

Principal Component Regression — decompose \(\mathbf{Y}\) into PCA axes and use the first \(l\) of these axes to predict \(\mathbf{x}\): analogue::pcr()

Partial Least Squares is similar — find orthogonal components in \(\mathbf{Y}\) that are most correlated with \(\mathbf{x}\), and use the first \(l\) to predict \(\mathbf{x}\)

WA-PLS is a non-linear version of PLS — use row & column sums to turn a linear method into a non-linear one, just like CCA does: rioja::WAPLS()

An important aspect of fitting these models is to choose \(l\) — the number of components to retain

Model performance

Bias

Bias is the tendency for the model to over or under predict

Average bias is the mean of the residuals

Maximum bias is found by breaking the range of the measured environment into \(n\) contiguous chunks (\(n = 10\) usually)

Within each chunk calculate the mean of the residuals for that chunk

Take the maximum value of these as the maximum bias statistic

Bias

op <- par(mfrow = c(1,2))
plot(wa_m)
par(op)

Crossvalidation

Without cross-validation, prediction errors, measured by RMSEP, will be biased, often badly so, because we use the same data to both fit & test the model

Ideally we’d have such a large training set that we can split this into a slightly smaller training set and a small test set

Palaeoecological data is expensive to obtain — in money and person-hours!

Also these ecosystems are complex, species rich, noisy etc., so we want to use all our data to produce a model

One solution to this problem is to use cross-validation

General idea: perturb the training set in some way, build a new model on the perturbed training set and assess how well it performs

If we repeat the perturbation several time we get an idea of the error in the model

Several techniques; n-fold, leave-one-out (LOO), bootstrapping, h-block CV

Crossvalidation in analogue

In analogue, several methods are available

For MAT models, LOO is built into the procedure so only bootstrapping is available

For WA models, both LOO and bootstrapping currently available

\(n\)-fold CV & h-block will be available in a future version (!)

Leave-one-out CV

LOO CV is very simple

In turn, leave out each sample from the training set

Build a model on the remaining samples

Predict for the left out sample

Calculate the RMSEP of these predictions

Leave-one-out CV

loo_pred <- predict(wa_m, V12.122, CV = "LOO", verbose = TRUE)
Leave one out sample 10 
Leave one out sample 20 
Leave one out sample 30 
Leave one out sample 40 
Leave one out sample 50 
Leave one out sample 60 
analogue::performance(wa_m)
    RMSE       R2 Avg.Bias Max.Bias 
   2.019    0.917    0.000   -3.815 
analogue::performance(loo_pred)
   RMSEP       R2 Avg.Bias Max.Bias 
   2.218    0.900   -0.014   -4.599 

Leave-one-out CV

rioja::crossval(wa_r, verbose = FALSE)

Method : Weighted Averaging
Call   : WA(y = ik, x = SumSST) 

Tolerance DW       : No 
Monotonic deshrink : No 
No. samples        : 61 
No. species        : 27 
Cross val.         : loo 

Deshrinking regression coefficients:
      Inverse d/s Classical d/s
wa.b0     -5.6876        5.8905
wa.b1      1.2659        0.7246

Performance:
               RMSE      R2  Avg.Bias  Max.Bias    Skill
WA.inv       2.0188  0.9173    0.0000    3.8155  91.7299
WA.cla       2.1078  0.9173    0.0000    3.1324  90.9842
WA.inv_XVal  2.2179  0.9003   -0.0137    4.5985  90.0176
WA.cla_XVal  2.3073  0.9018   -0.0067    3.9326  89.1969

n-fold CV

n-fold CV or leave-group-out CV

Same as LOO but instead of leaving out one sample at a time, we leave out an entire group of samples

n is usually 10

h-block CV

h-block CV very similar to LOO & \(n\)-fold

Instead of breaking training set into n groups & leaving one out at a time, we leave out all observations withinh distance of the target sample

Repeat for each sample in turn (like LOO)

Useful when training data are autocorrelated — as they often are in marine settings

USe rioja::crossval() to do h-block

Bootstrapping I

Bootstrapping used in machine learning to improve predictions

Use bootstrapping to get more realistic RMSEP and bias statistics

We draw a bootstrap sample (sampling with replacement) of the same size as our training set

Build a model on the bootstrap samples

Predict for the out-of-bag (OOB) samples

Bootstrapping II

Bootstrap prediction for each model sample is the mean of the OOB prediction for each sample

Calculate the residuals and then the RMSEP

\[\mathrm{RMSEP_{boot}} = \sqrt{s_1^2 + s_2^2}\]

\(s_1^2\) is the standard deviation of the OOB residuals

\(s_2^2\) is the mean of the OOB residuals

We can also calculate the more usual RMSEP \(\sqrt{\sum_{i=1}^n (y_i - \hat{y}_i)^2 / n}\)

Bootstrapping II

set.seed(1234)
sst_boot <- bootstrap(wa_m, n.boot = 999, verbose = FALSE)
sst_boot

    Weighted Averaging Bootstrap Predictions

Call:
bootstrap(object = wa_m, n.boot = 999, verbose = FALSE) 

Deshrinking     : Inverse 
Crossvalidation : bootstrap 
Tolerance DW    : No 

Performance:
   RMSEP        R2  Avg.Bias  Max.Bias  
  2.2919    0.9012   -0.0298   -4.6312  

Training set predictions:
  V14.61  V17.196  V18.110  V16.227   V14.47   V23.22    V2.12   V23.29 
  4.1761   3.8285   4.0208   3.9927   8.4033   9.2821   3.0228  14.6312 
  V12.43     R9.7   A157.3   V23.81   V23.82   V12.53   V23.83   V12.56 
 14.5555  17.0486  15.9813  19.0779  18.5811  18.7827  17.5376  20.5445 
 A152.84   V16.50  V22.122   V16.41    V4.32   V12.66  V19.245     V4.8 
 19.9792  19.8121  18.7319  22.9201  22.4598  20.7307  22.4718  22.0932 
 A180.15   V18.34  V20.213  V19.222  A180.39  V16.189   V12.18    V7.67 
 21.4827  23.3740  23.3249  22.8605  24.2542  25.6955  25.5307  23.2759 
 V17.165  V19.310  V16.190 A153.154  V19.308  V22.172   V10.98  V22.219 
 23.6758  23.0183  24.5248  25.3909  25.8264  26.3325  24.0536  25.4655 
  V16.33  V22.204  V20.167   V10.89   V12.79  V19.216   V14.90  A180.72 
 26.3604  25.8272  26.7432  26.4347  26.0909  25.7002  25.8305  26.3170 
  V16.21  A180.76  V15.164  A180.78    V14.5   V3.128  A179.13    V9.31 
 26.7791  26.6893  26.8161  25.9479  26.8964  26.7957  26.3988  26.0191 
 V20.230    V20.7  V20.234   V18.21  V12.122 
 26.6089  27.2027  26.7317  26.9673  26.8097 

Tuning?

With PCR, PLS, WA-PLS, and MAT we face an extra challenge

How to tune the model’s complexity parameter: \(l\) or \(k\)

How many components should we use? How many analogues?

Technically, we will underestimated RMSEP if we use CV to tune the model

We need a tuning test set

Tuning?

Form a test set from the training set (representative, evenly spaced) — set aside

Form a tuning or optimisation set from the training set (as above) — set aside

Build your TF using the remaining samples

  1. Use CV to generate RMSEP for the tuning set sample
  2. Find the value of \(l\) or \(k\) that gives lowest RMSEP on tuning set,
  3. Use CV to generate RMSEP for the test set using the selected value of \(l\) or \(k\)

You could also use a nested CV design — nothing readily available for anligue or rioja models though we should look at this!

Close analogues

A measure of reliability for the reconstructed values can be determined from the distance between each fossil sample and the training set samples

For a reconstructed value to be viewed as more reliable, it should have at least one close modern analogue in the training set

Close modern analogues are defined as those modern training set samples that are as similar to a fossil sample as a low percentile of the observed distribution dissimilarities in the training set, say the 5\(^{th}\) percentile

Close analogues

Which samples in the core have a close match in the training set?

v12_mdc <- minDC(wa_m, V12.122 / 100, method = "chord")
plot(v12_mdc, use.labels = TRUE, xlab = "Depth")

Sample-specific errors

We can use the bootstrap approach to generate sample specific errors for each fossil sample

\[\mathrm{RMSEP} = \sqrt{s^2_{1_{fossil}} + s^2_{2_{model}}}\]

\(s^2_{1_{fossil}}\) is the standard deviation of the bootstrap estimates for the fossil samples

\(s^2_{2_{model}}\) is the average bias, the mean of the bootstrap OOB residuals from the model

Sample-specific errors

set.seed(1234)
v12_boot <- predict(wa_m, V12.122 / 100, bootstrap = TRUE, n.boot = 999)
reconPlot(v12_boot, use.labels = TRUE, ylab = "Summer SST", xlab = "Depth", display.error = "bars")

Sick science?

Juggins (Quaternary Science Reviews, 2013)

Are transfer functions too good to be true?

\(\mathbf{x}\) must have biological/physiological control on the taxa

Secondary nuisance variables are (very) problematic — they can bias the estimated species–environment relationships

You could be reconstructing variation in something else!

Variation downcore that has nothing to do with \(\mathbf{x}\) will show up as variation in \(\mathbf{x}_0\)

SWAP

SWAP Example

Work through the SWAP example